^ LIBRARY 

RESEARCH REPCRTS DIVISION 
NAVAL POSTG,:.'PJATE SCHOOL 
MONTEREY, CALIFORNIA 93940 



NPS 54-83-008 

NAVAL POSTGRADUATE SCHOOL 

Monterey, California 




Racial Bias and Predictive Validity 
in Testing for Selection 



R. A. Weitzman 



July, 1983 



/^proved for public release; distribution unlimited 



FEDDOCS 

D 208.14/2:NPS-54-83-008 



pared for; 

al Postgraduate School 
•terey, Ca 93940 




NAVAL POSTGRADUATE SCHOOL 
Monterey, California 



Rear Admiral J. J. Ekelund 
Superintendent 



D. A. Schrady 
Provost 



This report received no direct research support. 



UNCLASSIFIED 



SECURITY CLASSIFICATION OF THIS PAGE Dmf Bnffd) 



REPORT DOCUMENTATION PAGE 


READ INSTRUCTIONS 
BEFORE COMPLETING FORM 


1. REPORT NUMBER 

NPS 54-83-008 


2. GOVT ACCESSION NO. 


9. REC|P|CNT*S CATALOG NUMBER 


4. TITLE (ana Sublllla) 

Racial Bias and Predictive Validity in 
Testing for Selection 


S. type of REPORT 8 PERIOD COVERED 

Technical , July 1983 


6 . PERFORMING ORG. REPORT NUMBER 


7. AUTHORf»; 

R. A. Weitzman 


6 . CONTRACT OR GRANT HUMBERT*; 


9. PERFORMING ORGANIZATION NAME AND ADDRESS 

Naval Postgraduate School 
Monterey, CA 93940 


10. PROGRAM ELEMENT. PROJECT. TASK 
AREA k WORK UNIT NUMBERS 


II. CONTROLLING OFFICE NAME AND ADDRESS 

Naval Postgradviate School 
Monterey, ca 93940 


la. REPORT DATE 

July 1983 


IS. NUMBER OF PAGES 

37 


U. monitoring agency name 6 ADDRESS^// di//or«nf from Controitink Oiiicm) 


IS. SECURITY CLASS, (oi thia raport) 

UNCLASSIFIED 


1S«. DECLASSIFICATION/^ OOWNGRAOING 
SCHEDULE 



16. 0ISTRI8UT1ON STATEMENT (ol thim Rmport) 



Approved for public release; distribution unlimited. 



17. distribution STATEMENT (ot th* mbstrmct •nt 0 r 0 d in Bioek 20, // dUfmtmu itom Rmport) 



18. SUPPLEMENTARY NOTES 



19. KEY WORDS (Contlnum on rovmrmm midm it nmcmmmmry mtd idmntifr ky klmck nimbrnr) 

Predictive Validity; Test Bias; Selection; Aptitude Tests; Scholastic 
Aptitude Test; Racial Bias 



20. ABSTRACT (Continum on rmvmr^m midm It nmemmmmrjr mnd idmntify by block numbmr) 

In contrast to the Cleary-McNemar view affirmed by Cole in the October 1981 
issue of the American Psychologist on testing--"questions of bias are fundamen- 
tally questions of val idity"— this report shows that freedom from statutory 
test bias, as interpreted by the courts, is different from predictive validity. 
Use of a score-adjustment formula developed here to correct for statutory test 
bias shows in typical cases not only that the correction tends only negligibly 
to reduce predictive validity but also that the enhancement of predictive 



DO I jan*73 1473 EDITION OF « NOV 98 IS OMOLETE UNCLASSIFIED 

S/N 0102- LF-014-6601 



SECURITY CLASSIFICATION OF THIS RAOE Dmf 



UNCLASSIFIED 

SECURITY CLASSIFICATION OF THIS PACE fPhwi Data BntaraO 



Block 20 Continued 

validity without regard to statutory test bias can add a sizable 
criterion-independent decrement selectively to the already low test 
scores of low-scoring demographic groups. 



S/ N 0102- LF- 014- 6601 



SECURITY CLASSIFICATION OF THIS PAGErWh#n Dmta Entmrmd) 




Test Bias and Validity 
1 



Abstract 

In contrast to the Cleary-McNemar view affirmed by Cole in the 
October 1981 issue of the American Psychologist on testing-- 
"questions of bias are fundamentally questions of validity"-- 
this report shows that freedom from statutory test bias , as 
interpreted by the courts, is different from predictive validity. 
Use of a score-adjustment formula developed here to correct for 
statutory test bias shows in typical cases not only that the 
correction tends only negligibly to reduce predictive validity 
but also that the enhancement of predictive validity without 
regard to statutory test bias can add a sizable criterion-inde- 
pendent decrement selectively to the already low test scores of 
low-scoring demographic groups. 



Test Bias and Validity 
2 

Racial Bias and Predictive Validity 
in Testing for Selection 

Widely perceived as gatekeepers of opportunity, tests have 
been a popular target of attack in the fight against racial dis- 
crimination (e.g., Notes 1 and 2). Mounting challenges have 
shaken test experts from the complacent position that tests are 
color-blind measuring instruments no more to blame for abnormally 
lov/ scores than thermometers are for abnormally high temperatures 
(Marston, 1971; Weitzman, 1972). Pioneered principally by Guion 
(1966) and Cleary (1968) , recent analyses have revealed occur- 
rences of putative discrimination in which tests often play a 
leading if unwitting and innocent role. Definitions of fairness 
in test use have not been uniform, a circumstance to which 
Flaugher (1978) has drawn particular attention, and treatments 
intended to assure one form of fairness would seem to work 
against another. Overcoming discrimination in test use, however, 
depends on the reconciliation of these differences. 

Development of expertise in a technical field like psycho- 
logical testing tends to produce increasing expectation and 
tolerance of complication. Whereas the public at large might 
condemn as biased a test on which white and black people have 
different means, a test expert is likely to consider this judg- 
ment to be premature. In the expert's view, more may be involved 
than simply a difference in test means. Particularly if the use 
of the test is to select applicants for work or school, the 
final verdict on test bias must also take into account subsequent 
performance on the job or in the classroom. If the racial group 



Test Bias and Validity 
3 

having the higher test mean tends to perform correspondingly better 
at work or school, then the difference in means may be a more 
accxirate reflection of test validity than of test bias. 

Different from most definitions of test bias reported in the 
literature on the use of tests in selection, the definition adopted 
in this report contrasts test bias with attenuation of both 
predictive validity and fairness in test use. Whereas a valid 
test tends neither to under- nor to overestimate the ability of 
any racial group, a biased test , as defined here, tends to distinguish 
between racial groups of equal ability. Because in selection the 
purpose of a test is prediction rather than measurement, "ability" 
in this definition refers not to a latent trait, like intelligence, 
but to the manifest criterion performance to be predicted. Defined 
in relation to validity, test bias is, like validity, a technical 
property of a test different from fairness, which is a property of 
test use. Even though a test is free from bias, therefore, a user 
of the test may still perceive a need for special selection proce- 
dures to assure its fair use. Jensen (1980) makes a corresponding 
distinction between freedom from test bias and fairness in test 
use, but for him test bias is differential validity, not a tendency 
to distinguish between racial groups of equal ability. Most other 
definitions of test bias in the literature turn out on consideration 
of the distinction between fair and unbiased testing actually to 
be definitions of fairness in test use. 

Procedures for assuring the fair use of specific tests require 
the use of either multiple cutting scores or equivalent score adjust- 



Test Bias and Validity 
4 

ments. Different definitions of fairness lead to correspondingly 
different pairs of cutting scores. The next section presents a 
critique of the use of multiple cutting scores. Proponents of one 
definition of fairness tend to argue against others. Succeeding 
sections consider some of the more critical of these argiiments » 
particularly in relation to the definiton of test bias adopted here. 
The final sections discuss the use of this definiton to approximate 
bias-free testing both with and without the use of score adjustments. 

Test bias can result in the evaluation of the members of one 
group differently from the members of another group solely because 
of group membership. The groups may differ with respect to any of a 
number of demographic variables such as race and sex. To simplify 
the discussion, this report will refer only to the variable race. 
Everything said here, however, will apply equally well to other 
demographic variables that distinguish groups. 

The discussion will frequently involve correlations between 
race and predictor or criterion variables. Whether these correlation 
are positive or negative depends on the mean scores of the two racial 
groups on these variables. In accordance with custom, a correlation 
will be positive if the mean score is higher and negative if the mean 
score is lower for the traditionally favored group. 

Selection, of course, involves both predictor and criterion 
variables. The problem of concern here is possible predictor bias. 
Criterion bias, also possible and certainly no less important (Flaugh 
1978; Green, 1975; Gulliksen, Note 3), is beyond the scope of this 
report . 

Multiple Cutting Scores 

Multiple-cutting-score definitions of fairness in test use 



Test Bias and Validity 
5 



reflect a variety of standards of fairness. According to 
Thorndike's (1971) definition, appropriately different cutting 
scores should yield selection proportions that match success pro- 
portions for the different groups. If X per cent of all appli- 
cants in Group A would be successful if selected, then the cut- 
ting score for Group A should yield the selection of X per cent 
of all Group A applicants. Cleary's (1968) definition likewise 
implies the use of different cutting scores if Group A and Group 
B applicants having the same predictor score have different mean 
criterion scores. The different cutting scores in this case 
should correspond to the criterion score separating success from 
failure. 

The Thorndike and Cleary proposals, contrasted at length 
by Schmidt and Hunter (1974), are only two among many. Linn 
(1973) and Cole (1973) proposed complementary standards of fair- 
ness for different racial groups: equal proportions of success- 

ful applicants among the selectees (Linn) and equal proportions 
of selectees among the successful applicants (Cole) . Resembling 
Linn's, a standard proposed by Einhorn and Bass (1971) is that 
at their cutting scores members of the different racial groups 
have equal probabilities of success or failure (risk) . Two pro- 
posals that directly require only a single cutting score 
indirectly, through score adjustment, require one for each racial 
group. These are proposals by Darlington (1971) and McNemar 
(1975) . Darlington, with only a single cutting score, achieves 
the effect of multiple cutting scores by adjusting the obtained 
criterion scores; McNemar, similarly, uses race as a predictor 



Test Bias and Validity 
6 

along with the predictor test to form a multiple predictor-- the 
single multiple-predictor cutting score yields the same selection 
as the multiple cutting scores in the Cleary procedure (see Fig- 
ure 1). Except for the McNemar and the Cleary procedures, which 
both use race to maximize validity, all these different procedures 
involve different standards of fairness and yield correspondingly 
different cutting scores. 

These differences constitute a compelling argument against 
the possibility of distributional justice (unconditionally fair 
reward) . According to Kaufmann (1973) , distributional justice is 
impossible because of the multiplicity of dimensions of fairness. 

A single judgment cannot be fair with respect to all dimensions. 

Multiple cutting scores, of course, reflect a concern with 
validity and standards of fairness lacking in the simple use of 
quotas. As just noted, however, the results of this concern have 
been unfortunate. Except for the Cleary and McNemar standard, 
which yields multiple cutting scores that generally favor white over 
black applicants (Cleary, 1968; Temp, Note 4; Schmidt and Hunter, 197 
all standards of fairness are inconsistent with maximal validity, as 
all, with no exception, are inconsistent among themselves. The 
effective difference between quotas and multiple cutting scores 
may thus be no more than that multiple cutting scores are less 
predictable than quotas in their impact on selection. 

Even proposals to achieve a social consensus regarding the 
establishment of multiple cuttings scores (e.g., Darlington, 

1971; Gross and Su, 1975; and Petersen and Novick, 1976) must 
fail, though not for technical reasons. Social consensus is sub- 
ject to test by the courts, and the courts have already cast a 



Test Bias and Validity 
6 (a) 



CRITERION 




Figure 1. An applicant having the predictor score at the foot 
of the vertical dashed line will have the same selection fate 
through the use of either the Cleary or the McNemar procedure — 
the applicant will be accepted if white (W) , rejected if black (B) . 



Test Bias and Validity 
7 



shadow over the use of multiple cutting scores. The United States 
Supreme Court in the pivotal Bakke case^ rejected the implementa- 
tion of racial quotas in college-admissions procedures. Succeed- 
ing court rulings are likely to move even further in the same 
direction, and the distance between quotas and multiple cutting 
scores, as just noted, is not great. Subsequent to the Bakke case, 
the 3rd District Court of Appeal in California, in fact, ruled 

against the use of score adjustments to offset lower minority 

2 

grades or test scores. An admissions officer who adds X points 
to the score of a minority applicant or uses for the applicant 
a cutting score X points lower than for a non-minority applicant 
will, of course, arrive at the same selection result. The Cali- 
fornia decision thus extends the Bakke rejection of quotas to 
prohibit the use of multiple cutting scores in selection. The 
use of any multiplicity of cutting scores, however determined, 
is indeed interpretable as the implementation of quotas , for is 
it not the intended effect of this use always to increase the 
admission proportion of one race relative to another? Even if 
the intent were separable from the effect, the use of multiple 
cutting scores would still replace arguable discrimination by 
unarguable reverse discrimination: the admission of black 

applicants who score lower than unadmitted white applicants. 

Test experts can argue that discrimination in one form or the 
other might occur without the use of multiple cutting scores, 

^Regents of the University of California vs. Bakke, 98 S.Ct. 

2733 (1978). 

2 

DeRonde vs. Regents of the University of California, 101 Cal. 

App. 34rd 191 (1980) . 



Test Bias and Validity 
8 

but this argument is likely to leave many non-experts uncon- 
vinced. Multiple cutting scores would thus appear to constitute 
a double standard ultimately justifiable only by the claim that 
the utility of a college education differs for different racial 
groups. People who are able to agree that a college education 
has greater utility for one racial group than for another ought 
also to be able to agree that the ownership of a new automobile 
has the same relative utilities for the two racial groups. The 
first agreement entails appropriately different cutting scores 
for admission to college, however, only if the second entails 
correspondingly different prices of new automobiles. Social con- 
sensus cannot establish multiple cutting scores because — if for 
no other reason — it points to the use of single ones. 

No imposed balance of the future against the past can be 
just. The past cannot be undone--people who suffered in the 
past suffered no less even if their descendants fare better. 

A person does not remove the bias from a coin that has turned 
up five successive heads by assuring that the next five tosses 
will produce tails. If each toss is bias-free, however, the 
proportion of heads will tend to equal the proportion of tails 
in the long-run. The proposal here is thus not to attempt to 
balance bias against counter-bias — discrimination against reverse 
discrimination — but rather to make each test use as nearly bias- 
free as possible. 

The Measurement of Test Bias 
A proper definition of test bias ought to imply actions 
that do not reduce discrimination at the expense of reverse 



Test Bias and Validity 
9 

discrimination. Rather than providing guidance only for the 
determination of multiple cutting scores or score adjustments, 
such a definition ought also to constitute the basis for measur- 
ing test bias. Armed with a measure of test bias, a test user 
can identify the test having the least bias among tests that are 
equally desirable in other respects. 

Test bias, as defined here, occurs when, and only when, 
race (R) accounts for variation on the predictor (P) that has 
no counterpart on the criterion (C) . The occurrence of test bias 
thus depends on the correlation between race and the component of 
the predictor uncorrelated with the criterion. This is the part 
correlation 



^R (P*C) 



r — r r 
RP PC RC 



/ 1 - r ^ 
PC 



( 1 ) 



Since this correlation is zero when, and only when, the component 
of P uncorrelated with C is also uncorrelated with R, the inequa- 
lity 

^R(P*C) ° 

must define test bias. 

This definition of test bias conforms to the requirements 

of the Civil Rights Act of 1964 as interpreted by the United 

States Supreme Court in 1971; Either (a) the use of tests in 

selection must have a numerically equal impact on minority and 

non-minority applicant groups or (b) any numerical advantage of 

one group over another must empirically reflect a corresponding 

3 

lob-related advantage. The second of these conditions fails 



3 



Griggs 



vs . 



Duke Power Co. , 



91 S.Ct. 849 (1971) . 



Test Bias and Validity 
10 

technically whenever a nonzero correlation exists between race 
and the component of the predictor that is not correlated with 
the criterion (job), that is, whenever * 0. 

Not only does a non-zero value of ^ define test bias, 

but also any value of ^R(p.c) constitutes a measure of test 
bias. Values of r^^p^^^ far from zero represent greater abso- 
lute bias than values close to zero. Positive values of r , . 

represent bias, negative values counterbias. A test for which 
rR(p.c) = 0 is bias-free. 

The definition r^^p^^^ ^ 0 corresponds closely to a defini- 

tion proposed by Darlington (1971): 

(3) 

Stating that at every value of the criterion the mean predictor 

score tends to be larger for one race than for the other, this 

definition seems to capture the essence of test bias — a uniform 

tendency to score one race above the other where no criterion 

difference exists. The partial correlation in this definition 

is proportional to the part correlation ^p(p.c) ' 

K = (1 - as the constant of proportionality; 

rl» 



r ^ 0 

RP*C 



^RP*C ^^R(P*C) 



(4) 



Since K does not depend on the predictor, the inequality 

r ^0 constitutes a definition of test bias equivalent to 

RP • C 

r ,T^ /-.» ^ 0 . Citing the definition r„ „ ^ 0 as the third of 

R(P*C) RP*C 

four possibilities, Darlington contrasted it with the first, also 
involving a partial correlation; 

^RC*P 



3 = 0 , 



(5) 



Test Bias and Validity 
11 



which is, in fact, a formulaic version of the Cleary (1968) defin- 
nition, whose use to determine multiple cutting scores was noted 
earlier. The differences between these two partial-correlation 
definitions support the adoption of ^ (or, equivalently, 

^R(P*C) ^ proper definition of test bias. The next 

section examines these differences. 

Partial-correlation Definitions 
One argument against the use of multiple cutting scores 
other than Cleary's is that their use attenuates validity (Dar- 
lington, 1976). Cleary's exception makes comparison of r^^ r> ^ 0 

KL, • ir 

with ^ especially important. To facilitate discussion, 

^RC'P ^ ^ will be called the Cleary definition and ^ 

the Darlington 3 definition of test bias. Both involve partial 
correlations in identical relationships. Their formal resem- 
blance, however, belies fundamental differences. As Hunter and 
Schmidt (1976) have pointed out, the two definitions are incon- 
sistent. Since proportional to r^^p - rp^r^^ and 

rp^.p is proportional to r^^ - tp^r^p , both cannot be equal to 
zero at the same time unless rp^ = 1 , which is unlikely to the 
point of impossibility (no test has perfect validity) . This 
inconsistency has implications for both the definition and the 

measurement of test bias: If r _ ^ 0 defines test bias, 

HP * 

rpc.p 0 cannot; if ^p,p.Q measures test bias, ^p^.p must 
measure something else. 

A scatterplot showing the relationship between standardized 
predictor and criterion measurements ^ provides a geo- 

metric representation of this inconsistency (see Figure 2) . If 



Test Bias and Validity 
11 (a) 



c 




Figure 2. Separation of racial (R) groups, A and B, with respect 

to two regression lines describing the relationship between 

standardized predictor (P) and criterion (C) variables when 

0 < r < 1 . The separation is parallel to the P-on-C line for 
PC 

a bias-free test (r^^p.^ = C-on-P line for a test that 

satisfies the Cleary condition * 



Test Bias and Validity 
12 



< 1, the regression lines for P on C and C on P cross at 
the origin. In the upper-right quadrant (shown) , the P-on-C line 
is above the C-on-P line. In the case of no racial differences, 
the population means of Groups A and B are equal on each variable. 
To satisfy the r^^p,^ = 0 (Darlington 3) condition for a bias- 
free test, racial differences must correspond to the separation 
of Group A and Group B points parallel to the P-on-C line. The 
population means for the two groups will now differ on each vari- 
able, but not on P for each sub-population having the same 
value of C. Satisfaction of the r^^ ^ p = 0 (Cleary) condition 
for a bias-free test corresponds to the separation of Group A 
and Group B points parallel to the C-on-P line. Since when 
rp^ < 1 this line is not parallel to the P-on-C line, satisfac- 
tion of the two conditions cannot occur simultaneously unless 

^PC 



The inconsistency just demonstrated means that any procedure 

that moves one of the two partial correlations toward zero must 

simultaneously move the other one away from zero to the extent 
2 

that r^^ differs from one. Procedures that move the Cleary 
partial correlation toward zero enhance validity (Darlington, 
1976). Validity must thus, to some extent, be a casualty of the 
adoption of Darlington 3 over Cleary. 

Technical grounds for deciding in favor of Darlington 3 
over Cleary do exist, however. In contrasting the r^^^p ^ 0 



Test Bias and Validity 
13 



and the 0 definitions of test bias, Darlington (1971) 

showed that the race-predictor correlation is considerably 

greater when = 0 than when = 0, particularly for 

low test validities. This difference in r„„ values is due to 
the correlation between race and the component of the predictor 
uncorrelated with the criterion which satisfaction of 

the Cleary equality = 0) tends to inflate along with ^pp.^* 

Attempts to satisfy the Cleary equality thus work to increase the 
race-predictor correlation inordinately in the process of 
extracting the entire potential contribution of race to validity. 
Indeed, if the validity achieved in this process is equal to the 
race-criterion correlation, whatever its value, then, because 
r^^ must be equal to Tp^r^p to satisfy the Cleary equality, the 
race-predictor correlation will swell to onel T-he Cleary 
procedure and its McNemar equivalent have no safeguard to 
prevent the predictor from becoming race itself ("white, you're 
in; black, you're out"). 

The technical decision between the r„^ „ and r„„ ,, defi- 
nitions of test bias is reducible to a more publically accessible 
decision. Only when i^p^.p is equal to zero does race affect 
the criterion through the predictor alone. This is certainly an 
advantage of r„_ What is wrong, however, if race has an 

effect on the criterion other than through the predictor? Some 



Test Bias and Validity 
14 



(but not all) validity due to race may be lost, but a zero value 
of permits a perhaps greater danger: Race may have an 

effect — a possibly large effect — on the predictor other than 
through the criterion. This effect is bias, which only a zero 
value of r^^^^ can eliminate. The decision between r^^^^p and 
^RP*C reduces to the decision between maximizing validity 

and minimizing bias. 

Strong support thus exists for the adoption of the Darling- 
ton 3 in preference to the Cleary definition of test bias. Read- 
ers familiar with published critiques that appear to challenge 
the Darlington 3 definition may not yet feel comfortable with 
this preference, however. The need now, then, is to consider 
these critiques. 



The Hunter-Schmidt Critique 

In reference to a predictor correlated with only one factor of 
a two-factor criterion. Hunter and Schmidt C1976) assumed this 
factor to be related to race and considered the two cases in which 
the other factor either was or was not also related to race. The 
first case becomes a problem for Darlington 3, according to Hunter 
and Schmidt, if in selection for college the two factors are un- 
correlated and the predictor is a pure measure of academic ability. 
In th±s case, which fails to satisfy the Cleary equality. Hunter 
and Schmidt observed that simply a failure also to satisfy the 
Darlington 3 equality could attach to a perfect predictor the 
opprobrium of bias. Failure to satisfy the Cleary equality alone, 
however, could not only attach to this predictor the same opprobriu 



Test Bias and Validity 
15 



but also the opprobrium of attenuated predictive validity. Since 
the purpose of a predictor is to predict rather than to measure, in 
fact, a predictor cannot be perfect for its purpose if the criterion 
predicted is an impure measure of what the predictor is a pure mea- 
sure. The validity that counts in selection is predictive, not 
construct, validity. Though pure, therefore, the predictor in this 
case is not perfect. In the second case, which satisfies the 
Cleary but not the Darlington 3 equality. Hunter and Schmidt argued 
that the choice of a predictor correlated with only the first (racial) 
factor may reflect merely ignorance of the second, not intentional 
bias. Effect, rather than intention, is the critical concern, how- 
ever, and solely the attempt to maximize predictive validity without 
any intention to create bias may produce this case. Apart from these 
two cases. Hunter and Schmidt wondered how a criterion, as the con- 
trol variable in Darlington 3, could cause race to covary with a 
predictor that necessarily preceded it in time. The use of Darling- 
ton 3 to indicate test bias makes no reference to causation, however. 
Any predictor, whatever it measures or fails to measure and whatever 
the intention that it do so, thus does indeed deserve the opprobrium 
of bias if, as indicated by Darlington 3, variation on it contains 
a component due to race that is absent from variation on the cri- 
terion. 

The Petersen-Novick Critique 

In a nice argument against the possibility of distributional 
justice, Petersen and Novick (1976) rejected the standards of 
fairness proposed by Thorndike (1971) , Linn (1973) , and Cole 







a 










■m 



m : 



Test Bias and Validity 
15(a) 

(1973) for determining multiple cutting scores. (In the case of 
Thorndike, the argument actually applies to an extended version 
requiring equal ratios — not necessarily 1:1 — of selection to 
success proportions for the different groups.) Each of these 
standards has an equally justifiable converse. No more or less 
justifiable than the "extended Thorndike" requirement that the 
proportion selected be the same fraction of the proportion suc- 
cessful for every group, for example, is the converse require- 
ment involving the proportion rejected and the proportion 



Test Bias and Validity 
16 



unsuccessful. The cutting scores determined by the application 
of a standard and its converse are generally different, however. 

The Thorndike, Linn, and Cole standards of fairness are thus 
internally inconsistent. 

A casual review of the literature suggests that the 
Petersen-Novick argument might also apply to the Darlington 3 
definition of test bias. Since a test for which = 0 rosy 

not require the use of multiple cutting scores to meet Cole's 
standard of fairness. Cole (1973) associated her definition of 
"test bias" with Darlington's r„„ ^ 7 ^ 0 , Both Hunter and 
Schmidt (1976) and Petersen and Novick (1976) acknowledged this 
association without examining it further. Further examination, 
however, shows that the Darlington 3 definition lies outside the 
purview of the Petersen-Novick argument. Only in the case of a 
binary criterion that dichotomizes the population into a subpopu- 
lation of successes and a subpopulation of failures may the 
Darlington 3 condition r„„ ^ = 0 generally obviate the need for 
multiple cutting scores to meet Cole’s standard of fairness, and 
the Petersen-Novick argument does not extend to this case. Under 
the conditions of predictor normality and homoscedasticity for 
the two racial groups within each criterion-defined subpopulation, 
in fact, the equality of predictor means implied by the Darlington 3 
condition = 0 in turn implies the simultaneous satisfaction 

of both the Cole standard of fairness and its converse. 



Test Bias and Validity- 
17 



The Darlington Critiques 

Although Darlington (1971) proposed the equality = 0 

i\P C 

as a possible definition of "culture fairness," he did not him- 
self endorse this proposal. In the same 1971 article and again 
later (Darlington, 1976) , he criticized all definitions involv- 
ing formulas as too mechanical. Darlington's own preference 
was for judgmental methods that reflect the relative importance 
of criterion performance and reverse discrimination. The prin- 
ciple underlying this preference is that some nonzero amount of 
reverse discrimination is desirable. The arguments made earlier 
regarding social consensus apply here. No principle favoring 
reverse discrimination is likely to prevail in a democratic 
society against the principle that the only desirable discrimi- 
nation is zero discrimination. 

Darlington (1976) extended his criticism of formulaic defi- 
nitions of "culture fairness" to include possible conflict with 
validity, mixture of technical (psychometric) and political 
arguments, insufficient and low-quality selection of minorities 
from applicant pools having poor minority representation, and 
the possible arbitrariness of unfavored-group identification. 

All but the first of these criticisms apply potentially to any 
procedure designed to remedy test bias or discrimination in test 
use. As long as both validity and fair test use are desirable, 
for example, no procedure can be free of either technical or 
political arguments. The first criticism, moreover, simply 
reflects reality: Validity and fair test use are, by most stan- 

dards of fairness, conflicting objectives. Validity will. 



Test Bias and Validity 
18 



indeed, be a casualty of all procedures, including those 
endorsed by Darlington, that yield selection results different 
from the results yielded by the Cleary or McNemar procedures. 

The apparent validity loss in one example cited by Darlington 
is serious enough to require special comment, however. In this 
example, based on empirical data, a white applicant whose pre- 
dicted criterion percentile is 50 would have the same selection 
fate as a black applicant whose corresponding percentile is only 
1. Darlington indicated neither how he used Darlington 3 to 
obtain this result nor what the percentile difference might be 
before its use. Percentile differences, in any case, are not 
validity coefficients. A proper evaluation of validity loss due 
to the use of any procedure would have not only to compare valid- 
ity coefficients determined both before and after the use of the 
procedure but also to include corresponding results of rival pro- 
cedures in the comparison. One commendable procedure, particu- 
larly, is to use neither multiple cutting scores nor differential 
score adjustment but the most valid available bias-free test. 
Regardless of the procedure used, however, test users must be 
prepared to sacrifice incremental validity bought at the expense 
of test bias or discrimination. 

Correction for Test Bias 

Test bias as defined by Cleary (1968) is correctable by the 
use of either the test (predictor) alone with multiple cutting 
scores (Cleary, 1968) or the test-race multiple predictor with 
a single cutting score (McNemar, 1975). The first of these 
options cannot work directly to make r 



equal zero because 



Test Bias and Validity 
19 

two predictor scores, while possibly corresponding by regression 
to the same criterion score (Cleary's procedure), cannot corre- 
spond by regression to the same predictor score. A form of score 
adjustment, like the second (McNemar) option, can work to make 
^RP*C zero, however. The adjustment 

P* = Zp + BZp , (6) 

where Zp is the standardized predictor and Zp the standardized 
racial measurement and 

p ^RP " ^RC^PC 

P - - 5 , (7) 

1 - ■>- 

RC 

will, in fact, make zero (see Appendix A for com- 

plete derivation). Though analogous to McNemar ' s multiple pre- 
dictor , 

Pt = SpZp * , (S) 

this adjustment has the effect of minimizing test bias rather 
than maximizing test validity. 

The next section illustrates the effect of the adjustment P* 
on test validity. 

Trade-off between Test Bias and Validity 

A trade-off exists between test bias and validity so that 
the elimination of bias creates a reduction in validity. An 
example will illustrate this trade-off. Table 1 (left column) 
describes a test having a validity (^p^) of .41 maximized by race 
with a race-predictor correlation (^pp) of .40, corresponding to 
a separation of 1.25 standard-deviation units (Sp) between black 
(Pp) and nonblack (P^^) means on the predictor: 



Test Bias and Validity 



Table 1 



19 (a) 



Trade-off between Bias and Validity when ~ ® 

Rw * P 



Test data 


Adjustment results 


Adjustment 


^^RC*P ^ 


^^RP*C " 






B = -.343 


P* = Zp + 3Zp 


rpc = -41 


^P*C = 


6Z^ = -0.12 


o 

• 

II 

pi) 

U 




6Z^ = +0.95 



Note . Zg (-2.775) is the standardized black and Z,^ ( + 0.35) 
the standardized nonblack racial measurement. 



Test Bias and Validity 
20 



""rp s; 



(9) 



where p = .112 (proportion of black people in the population). 
The remaining entries in Table 1 are determinable from these 
values of rp^ and r^p. Since rp^.p = 0 and r^p*.^ = 0 , 
the numerators in their formulas are also equal to zero; there- 
fore , 



T = IT 2T 

RC RP PC 



( 10 ) 



and 



where 



^RP* " ^RC^P*C ' 



•p*c 



r t 6r 
PC ^ RC 

/l + 26 rpp + 



( 11 ) 

( 12 ) 



(see Appendix B for complete derivation). The test is biased: 

= .37. The entries in the leftmost two columns of Table 

R (P • C) 

1 to be compared are the validities r_^ and r_ and the 

PC P 

race-predictor correlations r^p and ^^p* ‘ differences 

between r^p and ^^p* between rp^ and 

Table 1 indicate that elimination of test bias results in a 

large reduction in the race-predictor correlation but only a 

2 

small reduction in validity. Whereas r^p is .16 = .40) 

v/ith a maximal validity of .41 for the unadjusted predictor (left 

column) , adjustment of the predictor to eliminate bias reduces 

the square of the race-predictor correlation virtually to 

zero (^pp* = *06) while reducing the maximal validity by only 

.02 to .39 (middle column). The trade-off thus involves a much 

2 

greater change in bias, reflected in the relative r^p and 



Test Bias and Validity 
21 



r ^ values, than validity- 
RP* ^ 

The amount of adjustment from P to P , however, was not 



small: an increase by almost one standard-deviation unit of each 

black score coupled with a modest decrease of each white score 
(right column of Table 1). The direction, no less than the amount 
of adjustment, is notable. Different from the McNemar (Cleary) 
adjustment, which would generally favor white applicants- this 
adjustment favors black applicants. The amount and direction of 
adjustment together reflect substantial bias in the unadjusted 
predictor against black applicants. 

This bias is disturbing because the example cited may not be 
atypical. Predictors are common on which white means exceed 
black means by more or less 1.25 standard-deviation units to 

produce r values of around .40 (see Table 2, based on data 

RP 

reported by Temp, Note 4). The median validity of the Scholastic 
Aptitude Test is .41 (Note 5, p. 16). Corresponding through 
Equation (9) to the white-black mean difference of .5 
standard-deviation units typical of a performance measure (Hunter, 
Schmidt, and Rauschenberger , 1977, p. 249), the race-criterion 
correlation of .16 is the correlation with race that a criterion 
would have if .41 were the validity maximized by race with an ^RP 
value of .40. Use of race to maximize validity can thus, by its 
effect on the race-predictor correlation, often produce 
considerable test bias. 



Test Bias and Validity 
21(a) 



Table 2 

White-minus-black SAT Mean Differences in Standard-deviation Units 



School 


1 

Verbal | 


Quantitative 


i 

Difference Standard Deviation j 


Difference 


Standard Dev. 


1 


1.11 


81 ! 


1.34 


79 


2 


1.14 


81 j 


1.05 


93 


3 

1 


1.38 


77 I 


1.75 


81 


4 


1.46 


72 1 

i 


1.56 


89 


5 


1.26 


1 

109 i 


1.37 


108 


6 


1.34 


73 1 


1.10 


91 


7 


1.56 


101 


1.50 


107 


8 


1.09 


94 


1.16 


85 


9 


1.43 


96 


1.27 


99 


10 


1.61 


90 


1.91 


30 


11 


0.80 


89 


0.96 


85 


12 


0.83 


80 


0.77 


75 


13 


0.66 


91 


1.19 


91 



Note . Temp (Note 4, p. 10) reported means and standard deviations 
for the two racial groups separately. Use of the formulas 

P = pPg + (l-p)P^ and Sp^ = pSg + (l-p)S^ + p (Pp - P) ^ + 

— — 2 

(l-p) (P„ - P) , with p = .112, provided the total- group 
w 

information needed to compute the entries in this table. 



Test Bias and Validity 
22 



The use of race to enhance validity need not be evert or even 

intentional. Nor does this example mean that the SAT is racially 

biased — the combination of typical values of r_^, r„^, and r„_ may 

be quite rare, and all these values may be quite different in 

populations of randomly accepted applicants. Because in 

conditioning is on the criterion, use of values of r and r 

PC RC 

corrected for attenuation due to criterion unreliability may also 
be more appropriate than use of the observed values. 

Investigation of the possibility of bias for the SAT and other 
standardized tests is important, however. What is particularly 
disturbing is that inadvertently in the process of test 
construction race may have frequently in the past contributed 
substantially to test bias while contributing only modestly to 
validity. 

Recapitulation 

Pursuing the distinction between test bias and invalidity 
examined earlier, the previous two sections presented a correction 
for test bias and used this correction to examine the effects of 
both bias reduction on validity and validity enhancement on bias. 

The use of race to enhance validity can produce a race-predictor 

correlation that is not only much higher than the race-criterion 
correlation but also more or less as high as the predictor-criterion 
correlation (see Table 1). These relationships reflect test bias; 
a nonzero correlation between race and the component of the predic- 



Test Bias and Validity 
23 



tor uncorrelated with the criterion. Elimination of bias reduces 
the race-predictor correlation substantially while reducing validity 
only slightly . Corresponding to the positive race— criterion correla- 
tion, a positive race-predictor correlation still exists after the 
elimination of bias. This correlation, however, separates values 
indicative of bias above it from values indicative of counter-bias 
below. 



Test Choice versus Score Adjustment 
If attempts to maximize validity tend to produce test bias, 
what objective other than maximal validity should a test developer 
or test user pursue? Who of these two, moreover, should be the 
more responsible for the pursuit of this objective? These ques- 
tions arise from the distinction between bias and invalidity, 
and the first question particularly constitutes a problem that 
does not exist in the Cleary-McNemar view of test bias. Just as 
attempts to increase validity tend also to increase reliability, 
so in the Cleary-McNemar view do these attempts tend to decrease 
test bias. The problem created by distinguishing between test 
bias and invalidity is a problem of multiple objectives. 

Resolution of this problem is possible by recasting it so 
that one objective becomes a constraint while the other remains 
as the sole objective. In the trade-off between the two objec- 
tives, a large change in bias corresponds to a small change in 



Test Bias and Validity 
24 

validity. This imbalance suggests that zero bias should be the con 
straint. A constraint requires fixation at a specific value, 
moreover, and no such value for validity (other than one, which 
is all but impossible) corresponds to the value of zero bias. 

Test developers or test users should thus attempt to maximize 
validity under the constraint of zero bias. 

Since a correction for bias exists, this constrained maxi- 
mization is easier for the test user than the test developer. 

The test user need only adjust the predictor by use of equations 
(6) and (7). Though easier, however, this may not be the gener- 
ally better solution to the problem. The adjustment will typi- 
cally favor black applicants by a substantial amount (see Table 
1) . Argument that the unadjusted scores favor white applicants 
by the same amount is not likely to preclude charges of re- 
verse discrimination. The better solution politically, 
though not practically, may thus be for the test developer to 

attempt to make zero-biased tests with validities as high as 
possible. The test user would then have the politically accept- 
able role of choosing among two or more tests the one having 
maximal validity and minimal bias. 

Toward Bias-free Selection 

Ruling out the use of multiple cutting scores or differen- 
tial score adjustment, a test user may thus be left with the 
option of trying to choose among tests that differ by varying 
amounts in bias and validity. The choice between two tests is 
not difficult if both r^^ is closer to one and is 



Test Bias and Validity 
25 

closer to zero for one than for the other. Difficulty arises 
when for one test r^^ and for the other has the more 

favorable value. In this case, the choice depends on how the 
difference between the two rp^ values compares with the differ- 
ence between the two ^R(p*c) values. If one difference is sub- 
stantially larger than the other, then the test user can base 
his choice solely on the larger difference. Otherwise, consider- 
ations additional to bias and validity must determine the choice. 

The availability of more than two tests to choose from will 
generally facilitate the choice. Whether easy or difficult, 
however, choosing a test with minimal bias may be only a trivial 
step toward bias-free selection if even the minimal bias is far 
from zero. The real challenge is thus- not test choice but test 
development . 

The basic units of a test are, of course, its items. A test 
ought to be bias-free, therefore, to the extent that its items 
are bias-free. Bias-free items are not, of course, items whose 
difficulties are equal for the two racial groups. A test com- 
posed entirely of such items would have an r^p value of zero, 
lower than r values typical of bias-free tests (see Tables 
1 and 3) and thus indicative of counterbias. Freedom from bias 
must take the criterion into account. Using a definition of 
item bias like Darlington 3, therefore, Scheuneman (1979) devel- 
oped a method of item analysis to produce bias-free tests. 
According to Darlington 3, a test is bias-free if in each sub- 
population of individuals having the same criterion score the 
mean predictor scores are equal for the two racial groups; 



Test Bias and Validity 
26 

according to Scheuneman's definition, an item is bias-free if 
in each subpopulation of individuals having the same test score 
the item difficulties are equal for the two racial groups. 

Although the Darlington 3 definition extends naturally from tests 
to items, the Cleary definition does not. Extended to items, \ 

the Cleary definition makes no sense: An item would be bias- 

free if in each subpopulation of individuals having the same item 
score (correct or incorrect) the mean test scores were equal for 
the two racial groups! Actually, the Scheuneman counterpart of 
Darlington 3 involves complications because each item contributes 
not only to the total test bias but also to the total test score. 

The removal or addition of items to reduce test bias may thus 
alter the Scheuneman bias of the remaining items. Better than 
partialing out the total test score from the item-race correla- 
tion would be partialing out the criterion from this correlation. 
Criterion availability ought to be no problem for tests used in 
selection, and for the development of these tests Scheuneman's 
work provides a useful guide. 

Concluding Remarks 

Defining test bias consistently with the 1971 United States 
Supreme Court ruling regarding discrimination in selection, that 
individual differences having no effect on criterion performance 
should also have no effect on predictor performance, this report 
has demonstrated that considerable test bias, so defined, may result 
from attempts to maximize predictive validity to values attenuated 
only slightly by score adjustments that eliminate the bias. In a 



Test Bias and Validity 
27 

previous use of score adjustments, Hunter, Schmidt, and Rauschen- 
berger (1977) compared the trade-off effects of satisfying Cleary 
and other standards of fair test use, including Darlington 3, on 
black selection ratios and mean criterion performance of all accep- 
ted applicants: The ratios tended generally to be much higher and 

the performance only somewhat lower for Darlington 3 than for 
Cleary over the 0-1 range of post-adjustment total-group validities. 
Invalidating their results in numerical detail, though not overall 
contour. Hunter et al. mistakenly considered these validities to be 
equal within-group predictor-criterion correlations (pp. 249, 257), 
unaffected by score adjustments. Taking (C^ - C^) generally to 
be .5 and finding the corresponding ~ satisfy each 

condition of fair test use, they further considered the difference 
between this and its observed counterpart (P^ - Pg) to be the 
score adjustment for the condition. Since 

point-biserial correlations (see Equation (9)), particularly, the 
Darlington 3 condition r^p* = ^P*C^RC for standardized 

P* and C that (P^* - Pg*) = rp*^^(^ - Cg) . This is exactly the 
form of the corresponding Hunter et al. equation (p. 257) , presen- 
ted by them without derivation; differing only is the interpreta- 
tion of the correlation, seen here clearly to be the post-adjustment 
total-group validity (rp*^) . As different functions of post- 
adjustment total-group validities, the score adjustments used by 
Hunter et al. in the comparisons involving Darlington 3 and Cleary 
are not determinable from available data, and indeed their use to 
compare the different methods failed to show for each method the 



Test Bias and Validity 
28 

total-group validity changes from unknown though differing 
pre— ad j us tment values to predetermined common post— adjustinent 
values. In any event, the comparisons reported by Hunter et al., 
despite their numerical errors, generally support the position 
taken here that little, if any, justification exists for predictors 
of criterion performance, regardless of their validity, to distin- 
guish among groups of individuals who would tend to perform equally 

well on the criterion if given the chance. 

This position unequivocally refutes the assertion by Cole 

(1981X that "questions of bias are fundamentally questions of 
validity." Appearing in a special issue of the American Psychologist , 
on testing. Cole's assertion is an authoritative affirmation of the 
Cleary-McNemar position on test bias. Predictive validity is a 
compelling concept. Selection error sits opposite predictive vali- 
dity on a seesaw; as one goes down, the' other goes up. The tempta- 
tion is thus strong to extend the concept of predictive validity, 
as Cole did, to include other desirable test attributes, as well. 
Understanding this temptation may perhaps provide some fortification 



to resist it. 



Reference Notes 



Test Bias and Validity 
29 



1. Chapter 1217 (1978) , Postsecondary Education — Standardized 
Tests. Law added as Chapter 3 to Part 65 of the California 
Education Code. 

2. Chapter 672 (1979), Truth in Testing. Law added as Article 
7A to the New York State Education Law. 

3. Gulliksen, H. When high validity may indicate a faulty cri - 
terion (RM 76-10). Princeton, N.J.: Educational Testing 

Service, 1976. 

4 . Temp , G . Validity of the SAT for blacks and whites in thir - 
teen integrated institutions (RB-71-2). Princeton, N.J.: 
Educational Testing Service, 1971. 

5. Test use and validity . Princeton, N. J. : Educational Test- 



ing Service, 1980. 



References 



Test Bias and Validity 
30 



Cleary, T. A. Test bias: Prediction of grades of Negro and 

white students in integrated colleges. Journal of Educa- 
tional Measurement , 1968, 115-124. 

Cole, N. S. Bias in selection. Journal of Educational Measure - 
ment , 1973, 237-255. 

Cole, N. S. Bias in testing. American Psychologist , 1981, 36 , 
1067-1077. 

Darlington, R. B. Another look at "cultural fairness." Jour - 
nal of Educational Measurement , 1971, 71-82. 

Darlington, R. B. A defense of "rational" personnel selection, 
and two new methods. Journal of Educational Measurement , 
1976, 43-52. 

Einhorn, H. J., & Bass, A. R. Methodological considerations 

relevant to discrimination in employment testing. Psycho - 
logical Bulletin , 1971, 7_5, 261-269. 

Flaugher, R. L. The many definitions of test bias. American 
Psychologist , 1978, 3^, 671-679. 

Green, D. R. What does it mean to say a test is unbiased? Edu - 
cation and Urban Society , 1975, £, 33-52. 

Gross, A. L. , & Su, W. Defining a "fair" or "unbiased" selec- 
tion model: A question of utilities. Journal of Applied 

Psychology , 1975, 345-351. 

Guion, R. M. Employment tests and discriminatory hiring. 
Industrial Relations , 1966, 5(2) , 20-37. 

Hunter, J. E., & Schmidt, F. L. Critical analysis of the sta- 

tistical and ethical implications of various definitions of 
test bias. Psychological Bulletin, 1976, 83, 1053-1071. 



Test Bias and Validity 
31 

Hunter, J. E., Schmidt, F. L. , & Rauschenberger , J. M. Fairness 
of psychological tests: Implications of four definitions for 

selection utility and minority hiring. Journal of Applied 
Psychology , 1977, £2, 245-260. 

Jensen, A. R. Bias in mental testing . New York: The Free 

Press, 1980. 

Kaufmann, W. Without guilt and justice . New York: Wyden, 1973. 

Linn, R. L. Fair test use in selection. Review of Educational 
Research , 1973, 43^r 139-161. 

McNemar, Q. On so-called test bias. American Psychologist , 1975, 
30, 348-851. 

Marston, A. R. It is time to reconsider the Graduate Record 
Examination. American Psychologist , 1971, 653-655. 

Petersen, N. S., & Novick, M. R. An evaluation of some models 
for culture-fair selection. Journal of Educational Measure - 
ment , 1976, ]^, 3-29. 

Scheuneman, J. A method of assessing bias in test items. Journal 
of Educational Measurement , 1979, 143-152. 

Schmidt, F. C. , & Hunter, J. E. Racial and ethnic bias in psycho- 
logical tests. American Psychologist , 1974, 29_, 1-8. 

Thorndike, R. L. Concepts of culture-fairness. Journal of Edu - 
cational Measurement , 1971, 63-70. 

Weitzman, R. A. It is time to re-reconsider the GRE — a reply 
to Marston. American Psychologist , 1972, 22, 236-238. 



Appendix A 



Test Bias and Validity 
32 



This appendix derives an adjustment to the predictor score 
to make the partial correlation between race (R) and 

the predictor (P) controlling for the criterion (C) , equal to zero. 
Constituting only a single condition, the equality 
n ~ ^ determine only a single constant, 6, in the mul- 

tiple predictor P* = Zp + gZp , where the Z's denote standard- 
ized measurements for the subscript variables. Since the numera- 
tor of Tpp^.j^ is rpp* - rp^rp*^, the strategy followed is to 
eauate r__. to r„^r_*„ and solve for 3: 



(l/N)ZZp (Zp + BZp) = rp^(VN)ZZ^ (Zp + BZp) (Al) 
or, on simplification and substitution separately on each side, 



^RP 



,Q = 



- ■*- _j_ (3 -V- * 

■RC PC " RC 



(A2) 



with solution for 6 yielding 



T" iT "T 

RP RC PC 



1 - r 



RC 



B 



2 



(A3) 



Appendix B 



Test Bias and Validity 
33 



This appendix develops a formula for the validity of P* , 



^P*C ' 

The strategy used is direct simplification and substitution 
(see Appendix A for notation) : 



P*C ^ N ^ S * 



(Bl) 



■'pc 

Sp* 



(B2) 



where S^* is the standard deviation of P* , 
Sp* = /(1/N) Z(Zp + B2p) " 

= /l + 23rpp + 3^ 

so that 

/I + 2Br„„ + S“ 

Hir 



(B3) 

(B4) 



(B5) 



DISTRIBUTION LIST 



Number of Copies 

Defense Documentation Center 2 

Cameron Station, Building T 
5010 Duke Street 
Alexandria, VA 22314 

Dean of Research 1 

Code 012 

Naval Postgraduate School 
Monterey, CA 93940 

Library (Code 0142) 2 

Naval Postgraduate School 
Monterey, CA 93940 

Professor B. Bloxom 1 

Professor R. Elster 1 

Professor M. Eitelberg 1 

Professor W. McGarvey 1 

Professor T. Swenson 1 

Professor G. Thomas 1 

Professor R. Weitzman 20 

Code 54 

Naval Postgraduate School 
Monterey, CA 93940 



I 

I 




DUDLEY KNOX LIBRARY - RESEARCH REPORTS 




5 6853 0 



068808 8 




